Forest-based Algorithms in Natural Language Processing
نویسندگان
چکیده
FOREST-BASED ALGORITHMS IN NATURAL LANGUAGE PROCESSING Liang Huang Supervisors: Aravind K. Joshi and Kevin Knight Many problems in Natural Language Processing (NLP) involves an efficient search for the best derivation over (exponentially) many candidates. For example, a parser aims to find the best syntactic tree for a given sentence among all derivations under a grammar, and a machine translation (MT) decoder explores the space of all possible translations of the source-language sentence. In these cases, the concept of packed forest provides a compact representation of huge search spaces by sharing common sub-derivations, where efficient algorithms based on Dynamic Programming (DP) are possible. Building upon the hypergraph formulation of forests and well-known 1-best DP algorithms, this dissertation develops fast and exact k-best DP algorithms on forests, which are orders of magnitudes faster than previously used methods on state-of-the-art parsers. We also show empirically how the improved output of our algorithms has the potential to improve results from parse reranking systems and other applications. We then extend these algorithms to approximate search when the forests are too big for exact inference. We discuss two particular instances of this new method, forest rescoring for MT decoding, and forest reranking for parsing. In both cases, our methods perform orders of magnitudes faster than conventional approaches. In the latter, faster search also leads to better learning, where our approximate decoding makes whole-Treebank discriminative training practical and results in an accuracy better than any previously reported systems trained on the Treebank. Finally, we apply the above materials to the problem of syntax-based translation and propose a new paradigm, forest-based translation. This scheme translates a packed forest of the source sentence into a target sentence, rather than just using 1-best or k-best parses as in usual practice. By considering exponentially many alternatives, it alleviates the propogation of parsing errors into translation, yet only comes with fractional overhead in running time. We also push this direction further to extract translation rules from packed
منابع مشابه
Forest Stand Types Classification Using Tree-Based Algorithms and SPOT-HRG Data
Forest types mapping, is one of the most necessary elements in the forest management and silviculture treatments. Traditional methods such as field surveys are almost time-consuming and cost-intensive. Improvements in remote sensing data sources and classification –estimation methods are preparing new opportunities for obtaining more accurate forest biophysical attributes maps. This research co...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملA Search in the Forest: Efficient Algorithms for Parsing and Machine Translation based on Packed Forests A DISSERTATION PROPOSAL in Computer and Information Science
Many problems in Natural Language Processing (NLP) involves an efficient search for the best derivation over (exponentially) many candidates. For example, a parser aims to find the best syntactic tree for a given sentence among all derivations under a grammar, and a machine translation (MT) decoder explores the space of all possible translations of the source-language sentence. In these cases, ...
متن کاملADABOOST ENSEMBLE ALGORITHMS FOR BREAST CANCER CLASSIFICATION
With an advance in technologies, different tumor features have been collected for Breast Cancer (BC) diagnosis, processing of dealing with large data set suffers some challenges which include high storage capacity and time require for accessing and processing. The objective of this paper is to classify BC based on the extracted tumor features. To extract useful information and diagnose the tumo...
متن کاملA Hybrid Optimization Algorithm for Learning Deep Models
Deep learning is one of the subsets of machine learning that is widely used in Artificial Intelligence (AI) field such as natural language processing and machine vision. The learning algorithms require optimization in multiple aspects. Generally, model-based inferences need to solve an optimized problem. In deep learning, the most important problem that can be solved by optimization is neural n...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008